-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Support tensor columns in to_tf
and to_torch
.
#24752
[Datasets] Support tensor columns in to_tf
and to_torch
.
#24752
Conversation
to_tf
and to_torch
.to_tf
and to_torch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Left some comments, mostly for my understanding!
) | ||
df = pd.concat([df1, df2]) | ||
ds = ray.data.from_pandas([df1, df2]) | ||
ds = maybe_pipeline(ds, pipelined) | ||
torchd = ds.to_torch(label_column="label", batch_size=2) | ||
|
||
num_epochs = 2 | ||
num_epochs = 1 if pipelined else 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does pipelined case not work with 2 epochs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't consume a pipeline multiple times, this test never worked with num_epochs = 2
and pipelining. See test_to_torch
for this same pattern.
@@ -246,5 +246,3 @@ This feature currently comes with a few known limitations that we are either act | |||
|
|||
* All tensors in a tensor column currently must be the same shape. Please let us know if you require heterogeneous tensor shape for your tensor column! Tracking issue is `here <https://github.com/ray-project/ray/issues/18316>`__. | |||
* Automatic casting via specifying an override Arrow schema when reading Parquet is blocked by Arrow supporting custom ExtensionType casting kernels. See `issue <https://issues.apache.org/jira/browse/ARROW-5890>`__. An explicit ``tensor_column_schema`` parameter has been added for :func:`read_parquet() <ray.data.read_api.read_parquet>` as a stopgap solution. | |||
* Ingesting tables with tensor columns into pytorch via ``ds.to_torch()`` is blocked by pytorch supporting tensor creation from objects that implement the `__array__` interface. See `issue <https://github.com/pytorch/pytorch/issues/51156>`__. Workarounds are being `investigated <https://github.com/ray-project/ray/issues/18314>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 doc updates. Could we also add some explicit examples / text in the docs on how to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm document how to do what? This should now transparently work with tensor columns without any considerations by the user, so there wouldn't be anything special to document at exchange time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not clear to me, and someone only passably familiar with the tensor code, how exactly I would use it. Could we have a runnable example showing it end to end with a tensor dataset?
Basically whatever's in the unit test, but in a more friendly example form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on having an e2e example for tensor data being a good idea, and I'm about to work on creating an e2e example for tensor data as part of the Datasets GA docs work in a separate PR, but the point that I'm making here is that the user shouldn't have to do anything differently in the .to_torch()
or .to_tf()
call just because their data contains tensor columns.
The tensor column --> ML framework tensor conversion will happen automatically, without any tensor-column-specific args to .to_torch()
or .to_tf()
, or special considerations for the upstream tensor column creation other than what's already documented in the tensor column guide: creating the tensor column in the first place. For .to_tf()
, the tensor spec of the columns needs to be given, but that is already a requirement for all columns.
Since this example work is already planned for before GA, could we merge this in order to unblock the user PoC that depends on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the example I'm asking for is more in length than this thread already. Let's set a high standard for documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed up an example so we can resolve this. I'll have to rewrite/delete this example early next week once other work is merged, which is one of the reasons that I didn't want to block the user PoC on this example, so I'd like it if we could be a bit more pragmatic in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I don't see any harm in blocking this PR on the e2e example, in fact. We should be treating docs like unit tests, and not a separate "optional" component. There shouldn't really be a separate e2e example effort, for example.
I left a separate comment on making the example here more fleshed out (maybe this is copying work from your other effort).
isinstance(dtype, pd.api.extensions.ExtensionDtype) | ||
for dtype in df.dtypes | ||
): | ||
values = np.stack([col.to_numpy() for _, col in df.items()], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm do we want to conver all columns to numpy if any column is an ExtensionDtype.
Or should we only do col.to_numpy() only if the dtype of that particular columns is an ExtensionDtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter, df.values
converts all columns to NumPy ndarrays anyway so it would be exactly the same.
a8d13e6
to
550b35d
Compare
550b35d
to
209095c
Compare
for the tensor columns, including the shape of each underlying tensor element in said | ||
column. | ||
|
||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really e2e--- could we start with a example dataset generated via ray.data.range_tensor()
(maybe using add_column to add extra cols), and follow with DataLoader to read the data after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved but I'd prefer we always ship with e2e examples.
BTW, just to add some color about end-to-end examples. I think there's a bit of a miscommunication; let me try to elaborate on what @ericl is trying to say. Clark, from what I understand, is that you are going to build an E2E example very soon, and Eric is seemingly asking for redundant work. I think what Eric is asking about is more of a runnable example with more context in the docstrings. From the user perspective, I think there's two things that is hard to understand about the current snippets on the tensor user guide:
(This is a bit of a departure from the current way of writing docs, I know. But I think we can start upping the standards, especially with our new focus on building out great docs. Also, these esoteric minimal snippets are a recent major pain point i've had with Serve and am a bit scarred from that experience ;) ) |
@ericl @richardliaw I completely agree that this should be the desired end-state! What I'm pushing back against is expanding the scope of this narrow PR to include docs work that is already in progress in other branches/PRs and has already been separately allocated. To be clear, the data team is already working on each of the items y'all have mentioned, just not in this PR.
I'd rather have discussions on those e2e examples in those upcoming PRs than in this PR, which is narrowly focused on unblocking basic use of the ML framework integrations with tensor data. |
I think the confusion lies in exactly how much expanded scope we're talking about here (the word "end-to-end" may have a conflated connotations). You'll make me happy with just a small 5-10 line addendum (which I tried to do below, btw, but i really couldn't get it to work): diff --git a/doc/source/data/dataset-tensor-support.rst b/doc/source/data/dataset-tensor-support.rst
index 6c6baea07..a268b4cd6 100644
--- a/doc/source/data/dataset-tensor-support.rst
+++ b/doc/source/data/dataset-tensor-support.rst
@@ -161,12 +161,22 @@ automatically converted to a Torch/TensorFlow tensor without incurring any copie
.. code-block:: python
+ import ray
+ import torch
+
+ ds = ray.data.range_tensor(10000, shape=(3, 5))
+
# Convert the dataset to a Torch IterableDataset.
torch_ds = ds.to_torch(
label_column="label",
batch_size=4,
unsqueeze_label_tensor=False,
)
+ dataloader = torch.utils.data.DataLoader(torch_ds, num_workers=0)
+
+ for x, y in dataloader:
+ # train model(x, y)
+
# Convert the dataset to a TensorFlow Dataset.
tensor_element_shape = (32, 32) |
The logic from my perspective is the following. A feature PR should be self-contained with the following things:
Obviously there's different degrees of investment for 1, 2, and 3. But, on the other hand, you still need comprehensive unit tests to verify correctness (which you've done a great job of). This PR does not provide enough documentation for even the proficient user to use it successfully (i tried and failed...). |
@richardliaw Again, agreed on those standards for PRs; the data team is in the middle of a massive docs push in order to create complete docs coverage at that standard, and we'll be strictly adhering to that standard post-GA. In order to unblock this PR and resolve this discussion, I'll add an e2e example to the tensor feature guide. But in case the motivation I gave above was not clear, I was trying to avoid adding this e2e example to the tensor feature guide since this will end up being redundant work that will be obviated in the next day or two:
Either of the parallel work in (1) or (2) will make this e2e example in the tensor data feature guide obsolete, and I was aiming to quickly get this bug fix to a user doing a PoC, so I was hoping that we could be pragmatic here. Anyways, I'll add an e2e example to get this unblocked. |
@richardliaw As for your e2e example attempt, did you check out the
You're also generating a single I'm going to be reworking the I'm guessing that an e2e example in the docstring and the feature guide will make this all much more clear, which is why we have a dedicated docs task for adding these! 😄 |
209095c
to
4c2e975
Compare
4c2e975
to
baa3608
Compare
@ericl @richardliaw @amogkam Any final feedback on the example or otherwise? If not, I'm planning on merging this today. |
This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR.
…ject#24752) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR.
…A. (#25010) * [Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial. * Test the CSV read with column types specified (#24398) Make sure users can read csv with columns types specified. Users may want to do this because sometimes PyArrow's type inference doesn't work as intended, in which case users can step in and work around the type inference. * [Datasets] [Docs] Add a warning about from_huggingface (#24608) Adds a warning to docs about the intended use of from_huggingface. * [data] Expose `drop_last` in `to_tf` (#24666) * [data] More informative exceptions in block impl (#24665) * Add a classic yet small-sized ML dataset for demo/documentation/testing (#24592) To facilitate easy demo/documentation/testing with realistic, small-sized yet ML-familiar data. Have it as a source file with code will make it self-contained, i.e. after user "pip install" Ray, they are all set to run it. IRIS is a great fit: super classic ML dataset, simple schema, only 150 rows. * [Datasets] Add more example data. (#24795) This PR adds more example data for ongoing feature guide work. In addition to adding the new datasets, this also puts all example data under examples/data in order to separate it from the example code. * [Datasets] Add example protocol for reading canned in-package example data. (#24800) Providing easy-access datasets is table stakes for a good Getting Started UX, but even with good in-package data, it can be difficult to make these paths accessible to the user. This PR adds an "example://" protocol that will resolve passed paths directly to our canned in-package example data. * [minor] Use np.searchsorted to speed up random access dataset (#24825) * [Datasets] Change `range_arrow()` API to `range_table()` (#24704) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail. * [Datasets] Support tensor columns in `to_tf` and `to_torch`. (#24752) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR. * Implement random_sample() (#24492) * Map progress bar title; pretty repr for rows. (#24672) * [Datasets] [CI] fix CI of dataset test (#24883) CI test is broken by f61caa3. This PR fixes it. * [Datasets] Add explicit resource allocation option via a top-level scheduling strategy (#24438) Instead of letting Datasets implicitly use cluster resources in the margins of explicit allocations of other libraries, such as Tune, Datasets should provide an option for explicitly allocating resources for a Datasets workload for users that want to box Datasets in. This PR adds such an explicit resource allocation option, via exposing a top-level scheduling strategy on the DatasetContext with which a placement group can be given. * [Datasets] Add example of using `map_batches` to filter (#24202) The documentation says > Consider using .map_batches() for better performance (you can implement filter by dropping records). but there aren't any examples of how to do so. * [doc] Add docs for push-based shuffle in Datasets (#24486) Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets. * [Doc][Data] fix big-data-ingestion broken links (#24631) The links were broken. Fixed it. * [docs] Fix import error in Ray Data "getting started" (#24424) We did `import pandas as pd` but here we are using it as `pandas` * [Datasets] Overhaul of "Creating Datasets" feature guide. (#24831) This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily. * [Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346) * Revamp the Getting Started page for Dataset (#24860) This is part of the Dataset GA doc fix effort to update/improve the documentation. This PR revamps the Getting Started page. What are the changes: - Focus on basic/core features that are bread-and-butter for users, leave the advanced features out - Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API) - Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case) - Use the same data example throughout to make it context-switch free - Use runnable code rather than faked - Reference to the code from doc, instead of inlining them in the doc Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Eric Liang <[email protected]> * [Datasets] Miscellaneous GA docs P0s. (#24891) This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely: - Documents Datasets resource allocation model. - De-emphasizes global/windowed shuffling. - Documents lazy execution mode, and expands our execution model docs in general. * [docs] After careful consideration, choose the lesser of two evils and set white-space: pre-wrap #24873 * [Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812) This PR makes several improvements to the Datasets' tensor story. See the issues for each item for more details. - Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(). - Automatically infer tensor columns when building tabular blocks in general. - Fixes shuffling and sorting for tensor columns This should improve the UX/efficiency of the following: - Working with pure-tensor datasets in general. - Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR. * [Datasets] Overhaul "Accessing Datasets" feature guide. (#24963) This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()). * [Datasets] Add FAQ to Datasets docs. (#24932) This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <[email protected]> * [Datasets] Add basic e2e Datasets example on NYC taxi dataset (#24874) This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset. The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data. * Revamp the Datasets API docstrings (#24949) * Revamp the Saving Datasets user guide (#24987) * Fix AIR references in Datasets FAQ. * [Datasets] Skip flaky pipelining memory release test (#25009) This pipelining memory release test is flaky; it was skipped in this Polars PR, which was then reverted. * Note that explicit resource allocation is experimental, fix typos (#25038) * fix the notebook test failure * no-op indent fix * fix notebooks test #2 * Revamp the Transforming Datasets user guide (#25033) * Fix range_arrow(), which is replaced by range_table() (#25036) * indent * allow empty * Proofread the some datasets docs (#25068) Co-authored-by: Ubuntu <[email protected]> * [Data] Add partitioning classes to Data API reference (#24203) Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Robert <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Stephanie Wang <[email protected]> Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Zhe Zhang <[email protected]> Co-authored-by: Ubuntu <[email protected]>
This PR adds support for tensor columns in the
to_tf()
andto_torch()
APIs.For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor.
For TensorFlow, this involves bypassing
df.values
when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series.In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR.
Related issue number
Closes #18314, #18315
Checks
scripts/format.sh
to lint the changes in this PR.